feat(ci): add HPC-grade CI/CD workflows, GPU hardware validation, and benchmark regression gate#6
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 27c24d6f85
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pip install pytest pytest-benchmark | ||
|
|
||
| - name: Run benchmark suite | ||
| run: pytest tests -m "benchmark" --benchmark-json benchmark.json -q |
There was a problem hiding this comment.
Run benchmark suite without an empty marker filter
pytest -m only executes tests matching the mark expression (pytest --help), and this repository currently has no benchmark-marked tests under tests/, so this command deselects everything and exits with code 5 (No tests collected). In this workflow context (PR and nightly), that causes the benchmark job to fail before the regression checker can run.
Useful? React with 👍 / 👎.
| run: pytest -m "not gpu" --cov=kernels --cov=implementations --cov-fail-under=85 | ||
|
|
||
| - name: Integration tests | ||
| run: pytest tests -m "integration and not gpu" -q |
There was a problem hiding this comment.
Remove HPC marker filters that select zero tests
The integration step uses -m "integration and not gpu", but there are no matching marks in the current tests/ tree, so pytest deselects all tests and returns exit code 5; the same issue also affects the fallback step in this workflow. As written, both matrix jobs can fail even when the codebase is otherwise healthy.
Useful? React with 👍 / 👎.
|
|
||
| - name: GPU integration and numerical tests | ||
| run: | | ||
| pytest tests -m "gpu or numerical" -q |
There was a problem hiding this comment.
Fix GPU test selection to avoid no-tests-collected failures
This GPU validation step filters on gpu/numerical markers, but there are no such marks in the present test suite, so pytest exits 5 after deselecting everything. On scheduled/manual self-hosted runs, the workflow will fail at test selection rather than validating hardware behavior.
Useful? React with 👍 / 👎.
Motivation
Description
docs/pipelines/HPC_CICD_ARCHITECTURE.mddescribing the layered pipeline, deterministic build strategy, validation gates, rollback policy, and scaling guidance..github/workflows/hpc-matrix.yml(expanded CPU/Torch & CUDA compatibility matrix),.github/workflows/gpu-hardware.yml(self-hosted GPU hardware validation + benchmark artifact upload),.github/workflows/benchmark.yml(benchmark regression enforcement), and.github/workflows/docs.yml(strict docs build).scripts/ci/check_benchmark_regression.pyand a placeholder baseline.ci/benchmarks/baseline.json, plusmkdocs.ymland links fromREADME.md/docs/README.mdto surface the new doc.CHANGELOG.mdto record the CI/CD additions and committed the new files to the PR branch.Testing
bash scripts/smoke.sh(project smoke tests) which completed successfully. ✅python -m compileall scripts/ci/check_benchmark_regression.pyand ranpython scripts/ci/check_benchmark_regression.py .ci/benchmarks/baseline.json 0.05, both succeeded (baseline is empty so gate is skipped). ✅mkdocsinstallation failed in this environment due to package index/proxy restrictions, somkdocs build --strictcould not be validated here.ruff check .which failed repository-wide due to pre-existing lint issues unrelated to the changes in this PR; these lint failures are external to the new CI artifacts and should be addressed separately. ❌a100,h100,rtx4090).Codex Task